Skip to main content

5. Set up Invoices Pipeline

Invoices is a fact table in ServiceTitan, whose API documentation is here.

Pipeline Configuration

  1. Execution Schedule: Run once daily at 6 AM Eastern Time.

  2. Data load strategy: Fetch all Invoices records that have been created or modified yesterday. This is an INCREMENTAL DEDUPE data load strategy.

    Invoices is a fact table with a large number of transactions every day, where invoices are created and modified. Over time, this table could grow to millions of rows of data. This makes it infeasible to load the entire invoices table every day. Hence, it is fetched incrementally - only the newly created or modified data is fetched.

    New data is deduplicated against existing data to ensure that no duplicate records are present,and that the most recently updated record replaces any older ones. Therefore an incremental dedupe requires additional information:

    • Unique Key - To identify unique records, to decide whether to update or insert data.
    • Sort Key - To identify the most recent record for the same unique key and to retain it.
  3. Data schema management: Automatically include any new fields added to the API.

  4. Data storage:

    • Azure Blob - In the Azure blob create an Invoices folder, and inside it create a folder for each day's pipeline run.

    • Azure SQL - In the 'servicetitan' schema, create a table called 'invoices' to receive data from the Azure Blob.

Pipeline Creation

  1. In the Integrations Tab, click on 'Select Data' in the ServiceTitan to ReportingDB integration.

  1. Click on '+ Add New Dataset'.

  1. Creating the data pipeline is a three-step process.

    1. API Details - Specify the API from which data is to be fetched.

      • Connection/Pipeline Name - Name the dataset, e.g., Invoices.

      • Pagination - DataStori automatically paginates over the APIs. If the API supports pagination, set it to Yes.

      • OperationId - Select the API from which to fetch the data. In this example, it is Invoices_GetList.

      • Variables - These are dynamic inputs to the APIs. For example, date parameters or dynamic strings to define the folder format. Create the following variables:

        • 'yesterday' and provide the value - (datetime.datetime.now().date() - datetime.timedelta(days=1)).strftime('%Y-%m-%d'). This variable will be passed to the API call and whenever the API runs, it will resolve the variable to yesterday's date.

        • 'folder_format' and provide the value - (datetime.datetime.now().date() - datetime.timedelta(days=1)).strftime('%Y-%m-%d') to create a date-wise folder in the blob.

        Variables are python datetime functions that support a variety of input and output formats.

      • API Parameters - API parameters are rendered based on the selected API.

        • Populate the ST-App-Key and TenantID. You get these values on setting up API authentication.

        • modifiedOnOrAfter - set it to "{{yesterday}}". We use the date field "modifiedOnOrAfter" in order to fetch all the data created or modified since 'yesterday'. Other date fields can be selected based on the defined business rules for data load.

          tip

          Using ModifiedOnOrAfter or LastModified date is the most reliable way or running pipelines in incremental mode. If you don't have a last modified column, consider using the Incremental Drop and Load strategy or write to contact@datastori.io for assistance.

    2. Schedule and Data Load Strategy - Define the pipeline run schedule, and how data is to be written to the Azure Blob and SQL table.

      • Select 6 AM Eastern Time or create a new schedule in Settings > Schedules.
      • Set the Dedupe Strategy to INCREMENTAL DEDUPE HISTORY.
      • Dedupe Keys Json Paths - Set it to data[*].id. ServiceTitan provides an ID with every entity. This is the unique key for a record. Dedupe and Sort Key documentation
      • Sort Keys Json Paths - Set it to data[*].modifiedOn. This field helps identify the latest among two records with the same unique ID. Dedupe and Sort Key documentation

    3. Destination Details: Specify the Azure Blob and Azure SQL locations where data is to be written.

      {{folder_format}} refers to the variable we created in the API Details section. A new folder is created for each day's execution using the variable {{folder_format}} and data is written to it.

      • Final deduped data is available in the 'final' folder.
      • Schema changes are tracked under the 'schema' folder.

      Sample output in the Azure Blob is as follows:

  2. Save the form and exit the page.

With the above setup, the Invoices pipeline will run on schedule at 6 AM Eastern Time. To run it on demand, go to Integrations > Select Data (for the Invoices pipeline) and click on the Play button under Datasets.

Pipeline Execution

The pipeline executes in the customer's cloud environment. This can be audited from the Azure Container Instances logs.

Pipeline execution status is shown in the dashboard and any failure notifications are sent to the email IDs specified in Settings > Notifications.